21 research outputs found

    Efficient and Noise-Tolerant Reinforcement Learning Algorithms via Theoretical Analysis of Gap-Increasing and Softmax Operators

    Get PDF
    Model-free deep Reinforcement Learning (RL) algorithms, a combination of deep learning and model-free RL algorithms, have attained remarkable successes in solving complex tasks such as video games. However, theoretical analyses and recent empirical results indicate its proneness to various types of value update errors including but not limited to estimation error of updates due to finite samples and function approximation error. Because real-world tasks are inherently complex and stochastic, such errors are inevitable, and thus, the development of error-tolerant RL algorithms are of great importance for applications of RL to real problems. To this end, I propose two error-tolerant algorithms for RL called Conservative Value Iteration (CVI) and Gap-increasing RetrAce for Policy Evaluation (GRAPE). CVI unifies value-iteration-like single-stage-lookahead algorithms such as soft value iteration, advantage learning and Κ-learning, all of which are characterized by the use of a gap-increasing operator and/or softmax operator in value updates. We provide detailed theoretical analysis of CVI that not only shows CVI\u27s advantages but also contributes to the theory of RL in the following two points: First, it elucidates pros and cons of gap-increasing and softmax operators. Second, it provides an actual example in which performance of algorithms with max operator is worse than that of algorithms with softmax operator demonstrating the limitation of traditional greedy value updates. GRAPE is a policy evaluation algorithm extending advantage learning (AL) and retrace, both of which have different advantages: AL is noise-tolerant as shown through our theoretical analysis of CVI, while retrace is efficient in that it is off-policy and allows the control of bias-variance trade-off. Theoretical analys is of GRAPE shows that it enjoys the merits of both algorithms. In experiments, we demonstrate the benefit of GRAPE combined with a variant of trust region policy optimization and its superiority to previous algorithms. Through these studies, I theoretically elucidated the benefits of gap-increasing and softmax operators in both policy evaluation and control settings. While some open problems remain as explained in the final chapter, the results presented in this thesis are an important step towards a deep understanding of RL algorithms.Okinawa Institute of Science and Technology Graduate Universit

    When to Replan? An Adaptive Replanning Strategy for Autonomous Navigation using Deep Reinforcement Learning

    Full text link
    The hierarchy of global and local planners is one of the most commonly utilized system designs in autonomous robot navigation. While the global planner generates a reference path from the current to goal locations based on the pre-built static map, the local planner produces a kinodynamic trajectory to follow the reference path while avoiding perceived obstacles. To account for unforeseen or dynamic obstacles not present on the pre-built map, ``when to replan'' the reference path is critical for the success of safe and efficient navigation. However, determining the ideal timing to execute replanning in such partially unknown environments still remains an open question. In this work, we first conduct an extensive simulation experiment to compare several common replanning strategies and confirm that effective strategies are highly dependent on the environment as well as the global and local planners. Based on this insight, we derive a new adaptive replanning strategy based on deep reinforcement learning, which can learn from experience to decide appropriate replanning timings in the given environment and planning setups. Our experimental results demonstrate that the proposed replanner can perform on par or even better than the current best-performing strategies in multiple situations regarding navigation robustness and efficiency.Comment: 7 pages, 3 figure

    Benchmarking Actor-Critic Deep Reinforcement Learning Algorithms for Robotics Control with Action Constraints

    Full text link
    This study presents a benchmark for evaluating action-constrained reinforcement learning (RL) algorithms. In action-constrained RL, each action taken by the learning system must comply with certain constraints. These constraints are crucial for ensuring the feasibility and safety of actions in real-world systems. We evaluate existing algorithms and their novel variants across multiple robotics control environments, encompassing multiple action constraint types. Our evaluation provides the first in-depth perspective of the field, revealing surprising insights, including the effectiveness of a straightforward baseline approach. The benchmark problems and associated code utilized in our experiments are made available online at github.com/omron-sinicx/action-constrained-RL-benchmark for further research and development.Comment: 8 pages, 7 figures, submitted to Robotics and Automation Letter

    Adapting to game trees in zero-sum imperfect information games

    Full text link
    Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn ϔ\epsilon-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound O(H(AX+BY)/ϔ2)\mathcal{O}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2) on the required number of realizations to learn these strategies with high probability, where HH is the length of the game, AXA_{\mathcal{X}} and BYB_{\mathcal{Y}} are the total number of actions for the two players. We also propose two Follow the Regularize leader (FTRL) algorithms for this setting: Balanced-FTRL which matches this lower bound, but requires the knowledge of the information set structure beforehand to define the regularization; and Adaptive-FTRL which needs O(H2(AX+BY)/ϔ2)\mathcal{O}(H^2(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2) plays without this requirement by progressively adapting the regularization to the observations

    Local and adaptive mirror descents in extensive-form games

    Full text link
    We study how to learn Ï”\epsilon-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by TT. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy. Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set, using individually decreasing learning rates and a regularized loss. We show that this approach guarantees a convergence rate of O~(T−1/2)\tilde{\mathcal{O}}(T^{-1/2}) with high probability and has a near-optimal dependence on the game parameters when applied with the best theoretical choices of learning rates and sampling policies. To achieve these results, we generalize the notion of OMD stabilization, allowing for time-varying regularization with convex increments
    corecore